library(tidyverse)
library(infer)
library(mosaic)
library(caret)
library(ggplot2)
library(mdsr)
library(rpart)
library(rpart.plot)
library(partykit)
library(randomForest)
library(pROC)
library(gbm)
library(Metrics)
library(viridis)
require(maps)
require(viridis)
theme_set(
  theme_void()
  )
library(knitr)
library(tmap)
library(ggpubr)

The Premise

Hello! This project is the final for my STAT228: Introduction to Data Science course at Simmons University. With this project, I’d like to encompass everything I’ve learned throughout the semester, as well as some additional data science methods that I have taught myself outside of class. (The primary non-curriculum methods I am using come from the package tmap, which I was made aware of from a LinkedIn post by Joachim Schork, a data science educator & consultant from Germany). The premise of my project is to predict & analyze Women’s Empowerment Index scores for countries ; in this project, I aim to find the best version of the model predicting WEI scores using a variety of ensemble methods. There are two datasets I’m interested in using here:

I’d like to join the two datasets by the common variable “Country”, and analyze WEI scores by health factors related to the patient’s country. I am interested in creating several maps that will visualize WEI scores against other health-based factors.

Cleaning the Data

I want to start by filtering led to only include data where the year = 2015, because this is the most recent year within this dataset.

led_2015 <-     led%>%
    filter(Year=="2015")

Now, I want to rename several datapoints within the led_2015 dataset because I want the data names to be congruent between led_2015 and wei. This will be very tedious, but it is necessary to be able to join the datasets!

led_2015[led_2015$Country == "Bolivia(PlurinationalStateof)", "Country"] <- "Bolivia"

wei[wei$Country == "Bolivia(Plurinational State of)", "Country"] <- "Bolivia"

led_2015[led_2015$Country == "BosniaandHerzegovina", "Country"] <- "Bosnia and Herzegovina"

led_2015[led_2015$Country == "BurkinaFaso", "Country"] <- "Burkina Faso"

wei[wei$Country == "Congo (Democratic Republic of the)", "Country"] <- "DemocraticRepublicoftheCongo"

led_2015[led_2015$Country == "CostaRica", "Country"] <- "Costa Rica"

led_2015[led_2015$Country == "DominicanRepublic", "Country"] <- "Dominican Republic"

led_2015[led_2015$Country == "ElSalvador", "Country"] <- "El Salvador"

led_2015[led_2015$Country == "Iran(IslamicRepublicof)", "Country"] <- "Iran (Islamic Republic of)"

led_2015[led_2015$Country == "LaoPeople'sDemocraticRepublic", "Country"] <- "Laos"

wei[wei$Country == "Lao People's Democratic Republic", "Country"] <- "Laos"

led_2015[led_2015$Country == "RepublicofMoldova", "Country"] <- "Moldova (Republic of)"

led_2015[led_2015$Country == "SierraLeone", "Country"] <- "Sierra Leone"

led_2015[led_2015$Country == "SouthAfrica", "Country"] <- "South Africa"

led_2015[led_2015$Country == "SriLanka", "Country"] <- "Sri Lanka"

led_2015[led_2015$Country == "TheformerYugoslavrepublicofMacedonia", "Country"] <- "North Macedonia"

led_2015[led_2015$Country == "UnitedRepublicofTanzania", "Country"] <- "Tanzania (United Republic of)"

wei[wei$Country == "Türkiye", "Country"] <- "Turkey"

led_2015[led_2015$Country == "UnitedArabEmirates", "Country"] <- "United Arab Emirates"

led_2015[led_2015$Country == "UnitedKingdomofGreatBritainandNorthernIreland", "Country"] <- "United Kingdom"

led_2015[led_2015$Country == "UnitedStatesofAmerica", "Country"] <- "United States"

led_2015[led_2015$Country == "VietNam", "Country"] <- "Viet Nam"

That took a WHILE! But, now my two datasets should be better matched, and it’ll be much easier to join them!

Now I’m going to remove some unnecessary variables from the datasets This is because some variables have too many missing datapoints to be useful, or because some variables may be redundant.

led_2015 = subset(led_2015, select = -c(Alcohol, Totalexpenditure,Year))

Now, I will be joining the two datasets using inner_join, because I only want the countries in common between both dataframes. I’m also going to create a second version that is full_joined, just in case I end up needing it!

dataset <- wei%>% inner_join(led_2015,by="Country")

dataset2 <- wei%>% full_join(led_2015,by="Country")
dim(dataset)
## [1] 112  25

Finally, I will be renaming a few variables because I dislike having spaces in my variable names. After this, I will have finished cleaning the data! I’m also removing a few more variables just because they seem rather redundant - for example, the Gender Parity Group and Gender Parity Index variables may be too similar to WEI, therefore they might have skewed levels of correlation.

dataset <- dataset %>%
  rename(WEI = "Women's Empowerment Index (WEI) - 2022")

dataset <- dataset %>%
  rename(WEG = "Women's Empowerment Group - 2022")

dataset <- dataset %>%
  rename(GGPI = "Global Gender Parity Index (GGPI) - 2022")

dataset <- dataset %>%
  rename(HDG = "Human Development Group - 2021")

dataset <- dataset %>%
  rename(GPG = "Gender Parity Group - 2022")

dataset <- dataset %>%
  rename(SDG_Region = "Sustainable Development Goal regions")
dataset = subset(dataset, select = -c(GPG, SDG_Region,GGPI))

All finished cleaning! Now I want to take a peek at the dataset I’ll be working with before I go into the next steps.

dim(dataset)
## [1] 112  22
summary(dataset$WEI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1410  0.5180  0.6150  0.6085  0.7070  0.8280
tally(dataset$Status)
## X
##  Developed Developing 
##         31         81

Intial Visualization

Now, I want to create a map of WEI by country, just to get a general visual of how the distribution looks.

I’m going to start out by creating a joined dataset between dataset and World, so I can have a mappable dataset using the tmap package.

library(tmap)
data("World")

dataset<- dataset %>%
  rename(name = Country)

wei.map <- inner_join(dataset,World,by="name")

Unfortunately, the wei.map only contains 98 individuals despite dataset containing 112 individuals! This means that 14 individuals have disharmonious names between dataset and World. I’m going to run some code renaming a whole bunch of countries, just like I did above with the led_2015 dataset. I’m going to hide this code, just because it’s super tedious work. Tidying & cleaning your data is a constant process!

dataset[dataset$name == "Bosnia and Herzegovina", "name"] <- "Bosnia and Herz."

dataset[dataset$name == "DemocraticRepublicoftheCongo", "name"] <- "Dem. Rep. Congo"

dataset[dataset$name == "Dominican Republic", "name"] <- "Dominican Rep."

dataset[dataset$name == "Iran (Islamic Republic of)", "name"] <- "Iran"

dataset[dataset$name == "Laos", "name"] <- "Lao PDR"

dataset[dataset$name == "North Macedonia", "name"] <- "Macedonia"

dataset[dataset$name == "Moldova (Republic of)", "name"] <- "Moldova"

dataset[dataset$name == "Tanzania (United Republic of)", "name"] <- "Tanzania"

dataset[dataset$name == "Viet Nam", "name"] <- "Vietnam"

Now, let’s try re-joining these datasets!

wei.map <- right_join(dataset,World,by="name")
library(s2)
library(sf)
## Warning: package 'sf' was built under R version 4.3.2
wei.map.sf  <- wei.map %>% 
  st_as_sf()

And visualizing:

tmap_mode("plot")
tm_shape(wei.map.sf) +
    tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

Lovely! I also made an interactive version of the map, just for further exploration of the tmap package.

tmap_mode("view")
tm_shape(wei.map.sf) +
    tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

Now, I want to move onto creating models that will help me predict the WEI scores of countries based on other factors.

Decision Tree

I’m going to start by creating a decision tree for my dataset. I believe this will be beneficial to see, seeing as the decision tree will select the best predictors for my response variable. The first step here is to split my data into training and testing data.

set.seed(228)
train.index <- createDataPartition(dataset$WEI, p= 0.8, list = FALSE)
train <- dataset[train.index,]
test <- dataset[-train.index,]

I’m going to look at the proportion of developed vs developing countries within both my test and train data. This is so I can ensure that test is at least somewhat representative of the distribution within train.

tally(test$Status, format='proportion')
## X
##  Developed Developing 
##        0.3        0.7
tally(train$Status, format = 'proportion')
## X
##  Developed Developing 
##  0.2717391  0.7282609
tally(dataset$Status, format = 'proportion')
## X
##  Developed Developing 
##  0.2767857  0.7232143

Both test and train appear to be adequately representative of dataset!

Now, I will create my decision tree. I will fit the decision tree to train, and I will set nearly every predictor variable (BESIDES NAME & WEG, for reasons of redundancy) as my X. I am doing this so that the decision tree can tell me which predictor variables have the highest influence on my response variable, WEI.

tree <- rpart(WEI~ HDG + Status + Lifeexpectancy + AdultMortality + infantdeaths + HepatitisB + Measles +  Polio + Diphtheria + Population + Schooling, data=train, cp=.01)

rpart.plot(tree)

It appears there are three predictors that have the highest correlation with WEI scores : HDG (Human Development Group), Years of Schooling, and Adult Mortality Rates. For further exploration, I’m going to create a Random Forest to find our best model.

Random Forest & RMSE

I’m going to have to not use any variables that have “NA” values to be able to properly crearte a random forest. This unfortunately means removing Schooling. However, I don’t want to remove Schooling because according to the decision tree, it’s one of the most important predictors! So, I am going to use regression imputation to create some estimated Schooling values for any of our countries that have Schooling = NA.

missing <- is.na(dataset$Schooling) 

sum(missing)
## [1] 6
which(missing)
## [1]  11  30  31  47  72 105
lm(Schooling~WEI, data=dataset)
## 
## Call:
## lm(formula = Schooling ~ WEI, data = dataset)
## 
## Coefficients:
## (Intercept)          WEI  
##       2.574       17.943
dataset$Schooling[11]<-2.574+17.943*(0.707)
dataset$Schooling[30]<-2.574+17.943*(0.778)
dataset$Schooling[31]<-2.574+17.943*(0.752)
dataset$Schooling[47]<-2.574+17.943*(0.686)
dataset$Schooling[72]<-2.574+17.943*(0.399)
dataset$Schooling[105]<-2.574+17.943*(0.510)

Yet another example of never being done with cleaning the data! Now, I have to recreate my training and testing dataset without these missing values.

set.seed(228)
train.index <- createDataPartition(dataset$WEI, p= 0.8, list = FALSE)
train <- dataset[train.index,]
test <- dataset[-train.index,]

Then, I can finally create my Random Forest

set.seed(228)
rf_model <-randomForest(WEI ~ HDG + Status + Lifeexpectancy + Schooling + AdultMortality + infantdeaths + Measles +  Polio + Diphtheria ,data=train, proximity=TRUE, ntree=1000)
rf_model
## 
## Call:
##  randomForest(formula = WEI ~ HDG + Status + Lifeexpectancy +      Schooling + AdultMortality + infantdeaths + Measles + Polio +      Diphtheria, data = train, proximity = TRUE, ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.005114937
##                     % Var explained: 72

The mean of squared residuals is pretty close to 0, which means the model is good! However, I’m going to also try running a Random Forest model in which I only use the three best predictor variables as identified by my Decision Tree. I will then compare the two mean of squared residuals.

set.seed(228)
rf_model2 <-randomForest(WEI ~ HDG +  Schooling + AdultMortality  ,data=train, proximity=TRUE, ntree=1000)
rf_model2
## 
## Call:
##  randomForest(formula = WEI ~ HDG + Schooling + AdultMortality,      data = train, proximity = TRUE, ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.005898256
##                     % Var explained: 67.72

The mean of squared residuals is ever so slightly higher for this model, and the Var explained is slightly lower. I’m going to use an RMSE test to compare the two models.

Now, I’m going to boost the two models. Before doing so, I am changing HDG’s class to ordered, and Status’s class to ordered, so they arecompatible with the boosted model.

train$HDG <- as.ordered(train$HDG)
train$Status <- as.ordered(train$Status)
boost1 <- gbm(WEI ~ HDG + Status + Lifeexpectancy + Schooling + AdultMortality + infantdeaths + Measles +  Polio + Diphtheria, data=train,distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, n.minobsinnode=5)
boost2 <- gbm(WEI~ Schooling + HDG + AdultMortality , data = train, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, n.minobsinnode=5)

Now I will evaluate the two models using predictions & RMSE.

predictions1 <- predict(boost1,newdata=test)
rmse(actual=test$WEI, predicted=predictions1)
## [1] 0.0567137
predictions2 <- predict(boost2, newdata = test)
    rmse(actual=test$WEI, predicted=predictions2)
## [1] 0.05870663

Once again, the model with nearly very variable has proved to perform slightly better than the model with only 3 predictor variables. However, this difference is so small, and I think it is worth it to sacrifice this small difference for the sake of keeping our model SIMPLE! The model still performs very well. So, I’m deciding to keep the model that only contains the three best predictor variables!

Final Visualizations

wei.map <- right_join(dataset,World,by="name")

wei.map.sf  <- wei.map %>% 
  st_as_sf()

I’m going to create FOUR maps here: One exactly like my first visualization that shows WEI scores by country, and then three that shows the distribution of each of my predictor variables! I will display all of these maps side by side.

tmap_mode("plot")

 tm_shape(wei.map.sf) + tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("HDG",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Human Development Groups", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("Schooling",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Years of Schooling", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("AdultMortality",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Adult Mortality per Population of 1000", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

These are sort of confusing. For one, the HDG categories are out of order, which is very hard to read. Secondly, the adult mortality map should probably have the colors flipped, because lower adult mortality corresponds to higher WEI.

I’m going to fix them, and insert the png below.

knitr::include_graphics("/Users/jasmineday/Downloads/WEIPLOT.jpg")

And there we have our final visualization! I hope you enjoyed reading through this project as much as I enjoyed my STAT228 class :)

BONUS: Testing the Prediction Model

After finishing the first draft of my project, I figured I wanted to add a section in which I take a small handful of countries that are NOT represented in the wei dataset and use our model to predict the WEI scores for these nations. I will be using the countries Kazakhstan, New Zealand, Sudan, Haiti, and Ukraine. I will use led_2015 to source the HDG, Schooling, and AdultMortality for each of these countries.

finalmodel <- lm(WEI~HDG+Schooling+AdultMortality,data=dataset)
finalmodel
## 
## Call:
## lm(formula = WEI ~ HDG + Schooling + AdultMortality, data = dataset)
## 
## Coefficients:
##    (Intercept)          HDGLow       HDGMedium    HDGVery high       Schooling  
##      0.1862461      -0.0749196      -0.0202205       0.0717149       0.0282162  
## AdultMortality  
##      0.0002026
#Kazakhstan:
  #HDG=Very high
  #Schooling=15
  #AdultMortality=198
Kazakhstan_WEI <- .0186 + .0717 + .0282*15 + .0002*198
#New Zealand  
  #HDG=Very high
  #Schooling=19.2
  #AdultMortality=66
NZ_WEI <- .0186 + .0717 + .0282*19.2 + .0002*66
#Sudan
  #HDG=Low
  #Schooling=7.2
  #AdultMortality=225
Sudan_WEI <- .0186 - .0749 + .0282*7.2 + .0002*225
#Haiti
  #HDG=Medium
  #Schooling=9.1
  #AdultMortality=24
Haiti_WEI <- .0186 - .0202 + .0282*9.1 + .0002*24
#Ukraine
  #HDG=High
  #Schooling=15.3
  #AdultMortality=195
Ukraine_WEI <- .0186 + .0282*15.3 + .0002*195

Kazakhstan imputed/predicted WEI score = .553 New Zealand imputed/predicted WEI score = .645 Sudan imputed/predicted WEI score = .192 Haiti imputed/predicted WEI score = .260 Ukraine imputed/predicted WEI score = .489

These don’t feel entirely accurate. For example, Sudan’s imputed WEI score is just above that of Yemen’s, which is the lowest in the world. However, it still can be beneficial to see where there may be potential flaws in using this linear model to predict WEI scores! WEI score data on all countries is not public, so I am unfortunately unable to check the actual accuracy of these predictions.

---
title: "Finding the Best Model for Women's Empowerment Index Scores"
author: "Jasmine Day"
date: "May 6th 2024"
output: 
  html_document:
    code_folding: hide
    theme: flatly
    toc: true
    toc_float: true
    code_download: true
---

```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(infer)
library(mosaic)
library(caret)
library(ggplot2)
library(mdsr)
library(rpart)
library(rpart.plot)
library(partykit)
library(randomForest)
library(pROC)
library(gbm)
library(Metrics)
library(viridis)
require(maps)
require(viridis)
theme_set(
  theme_void()
  )
library(knitr)
library(tmap)
library(ggpubr)

```

```{r, echo = FALSE, warning = FALSE, message = FALSE}
library(readr)

wei <- read_csv("DATASETS/women_empowerment_index.csv")

led <- read_csv("DATASETS/led.csv")

```


# The Premise
Hello! This project is the final for my STAT228: Introduction to Data Science course at Simmons University. With this project, I'd like to encompass everything I've learned throughout the semester, as well as some additional data science methods that I have taught myself outside of class. (The primary non-curriculum methods I am using come from the package **tmap**, which I was made aware of from a LinkedIn post by Joachim Schork, a data science educator & consultant from Germany). The premise of my project is to predict & analyze Women's Empowerment Index scores for countries ; in this project, I aim to find the best version of the model predicting WEI scores using a variety of ensemble methods. There are two datasets I'm interested in using here: 

- 1. the *wei* dataset, which stands for Women's Empowerment Index, is the first dataset I'm going to be using. It is sourced from Human Development Reports, and can be accessed at the link below. (**https://www.kaggle.com/datasets/iamsouravbanerjee/women-empowerment-index**). It contains information on the WEI scores for 114 countries. 
- 2. The second dataset I'm interested in using is the *led* dataset, which stands for Life Exptectancy Dataset, and is sourced from the World Health Organization. This dataset can be accessed at the link below. (**https://www.kaggle.com/datasets/augustus0498/life-expectancy-who/data**). This dataset contains important health-centered data on every country in the world.

I'd like to join the two datasets by the common variable "Country", and analyze WEI scores by health factors related to the patient's country. I am interested in creating several maps that will visualize WEI scores against other health-based factors. 

# Cleaning the Data
I want to start by filtering *led* to only include data where the year = 2015, because this is the most recent year within this dataset.
```{r, class.source = "fold-show"}

led_2015 <- 	led%>%
	filter(Year=="2015")

```

Now, I want to rename several datapoints within the *led_2015* dataset because I want the data names to be congruent between *led_2015* and *wei*. This will be very tedious, but it is necessary to be able to join the datasets! 
```{r}

led_2015[led_2015$Country == "Bolivia(PlurinationalStateof)", "Country"] <- "Bolivia"

wei[wei$Country == "Bolivia(Plurinational State of)", "Country"] <- "Bolivia"

led_2015[led_2015$Country == "BosniaandHerzegovina", "Country"] <- "Bosnia and Herzegovina"

led_2015[led_2015$Country == "BurkinaFaso", "Country"] <- "Burkina Faso"

wei[wei$Country == "Congo (Democratic Republic of the)", "Country"] <- "DemocraticRepublicoftheCongo"

led_2015[led_2015$Country == "CostaRica", "Country"] <- "Costa Rica"

led_2015[led_2015$Country == "DominicanRepublic", "Country"] <- "Dominican Republic"

led_2015[led_2015$Country == "ElSalvador", "Country"] <- "El Salvador"

led_2015[led_2015$Country == "Iran(IslamicRepublicof)", "Country"] <- "Iran (Islamic Republic of)"

led_2015[led_2015$Country == "LaoPeople'sDemocraticRepublic", "Country"] <- "Laos"

wei[wei$Country == "Lao People's Democratic Republic", "Country"] <- "Laos"

led_2015[led_2015$Country == "RepublicofMoldova", "Country"] <- "Moldova (Republic of)"

led_2015[led_2015$Country == "SierraLeone", "Country"] <- "Sierra Leone"

led_2015[led_2015$Country == "SouthAfrica", "Country"] <- "South Africa"

led_2015[led_2015$Country == "SriLanka", "Country"] <- "Sri Lanka"

led_2015[led_2015$Country == "TheformerYugoslavrepublicofMacedonia", "Country"] <- "North Macedonia"

led_2015[led_2015$Country == "UnitedRepublicofTanzania", "Country"] <- "Tanzania (United Republic of)"

wei[wei$Country == "Türkiye", "Country"] <- "Turkey"

led_2015[led_2015$Country == "UnitedArabEmirates", "Country"] <- "United Arab Emirates"

led_2015[led_2015$Country == "UnitedKingdomofGreatBritainandNorthernIreland", "Country"] <- "United Kingdom"

led_2015[led_2015$Country == "UnitedStatesofAmerica", "Country"] <- "United States"

led_2015[led_2015$Country == "VietNam", "Country"] <- "Viet Nam"

```
That took a WHILE! But, now my two datasets should be better matched, and it'll be much easier to join them! 

Now I'm going to remove some unnecessary variables from the datasets This is because some variables have too many missing datapoints to be useful, or because some variables may be redundant. 
```{r, class.source = "fold-show"}

led_2015 = subset(led_2015, select = -c(Alcohol, Totalexpenditure,Year))

```

Now, I will be joining the two datasets using inner_join, because I only want the countries in common between both dataframes.  I'm also going to create a second version that is full_joined, just in case I end up needing it!
```{r, class.source = "fold-show"}
dataset <- wei%>% inner_join(led_2015,by="Country")

dataset2 <- wei%>% full_join(led_2015,by="Country")
dim(dataset)
```

Finally, I will be renaming a few variables because I dislike having spaces in my variable names. After this, I will have finished cleaning the data! I'm also removing a few more variables just because they seem rather redundant - for example, the Gender Parity Group and Gender Parity Index variables may be too similar to WEI, therefore they might have skewed levels of correlation.
```{r, class.source = "fold-show"}
dataset <- dataset %>%
  rename(WEI = "Women's Empowerment Index (WEI) - 2022")

dataset <- dataset %>%
  rename(WEG = "Women's Empowerment Group - 2022")

dataset <- dataset %>%
  rename(GGPI = "Global Gender Parity Index (GGPI) - 2022")

dataset <- dataset %>%
  rename(HDG = "Human Development Group - 2021")

dataset <- dataset %>%
  rename(GPG = "Gender Parity Group - 2022")

dataset <- dataset %>%
  rename(SDG_Region = "Sustainable Development Goal regions")
```

```{r, class.source = "fold-show"}
dataset = subset(dataset, select = -c(GPG, SDG_Region,GGPI))
```
All finished cleaning! Now I want to take a peek at the dataset I'll be working with before I go into the next steps. 
```{r, class.source = "fold-show"}
dim(dataset)
summary(dataset$WEI)
tally(dataset$Status)
```
# Intial Visualization 
Now, I want to create a map of WEI by country, just to get a general visual of how the distribution looks. 

I'm going to start out by creating a joined dataset between *dataset* and *World*, so I can have a mappable dataset using the **tmap** package. 
```{r, class.source = "fold-show"}
library(tmap)
data("World")

dataset<- dataset %>%
  rename(name = Country)

wei.map <- inner_join(dataset,World,by="name")

```
Unfortunately, the *wei.map* only contains 98 individuals despite *dataset* containing 112 individuals! This means that 14 individuals have disharmonious names between *dataset* and *World*. I'm going to run some code renaming a whole bunch of countries, just like I did above with the *led_2015* dataset. I'm going to hide this code, just because it's super tedious work. Tidying & cleaning your data is a constant process!
```{r}
dataset[dataset$name == "Bosnia and Herzegovina", "name"] <- "Bosnia and Herz."

dataset[dataset$name == "DemocraticRepublicoftheCongo", "name"] <- "Dem. Rep. Congo"

dataset[dataset$name == "Dominican Republic", "name"] <- "Dominican Rep."

dataset[dataset$name == "Iran (Islamic Republic of)", "name"] <- "Iran"

dataset[dataset$name == "Laos", "name"] <- "Lao PDR"

dataset[dataset$name == "North Macedonia", "name"] <- "Macedonia"

dataset[dataset$name == "Moldova (Republic of)", "name"] <- "Moldova"

dataset[dataset$name == "Tanzania (United Republic of)", "name"] <- "Tanzania"

dataset[dataset$name == "Viet Nam", "name"] <- "Vietnam"
```
Now, let's try re-joining these datasets!
```{r, class.source = "fold-show",message=FALSE}
wei.map <- right_join(dataset,World,by="name")
library(s2)
library(sf)
wei.map.sf  <- wei.map %>% 
  st_as_sf()
```
And visualizing:
```{r, message=FALSE}

tmap_mode("plot")
tm_shape(wei.map.sf) +
    tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

```
Lovely! I also made an interactive version of the map, just for further exploration of the tmap package.

```{r, message=FALSE}

tmap_mode("view")
tm_shape(wei.map.sf) +
    tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

```
Now, I want to move onto creating models that will help me predict the WEI scores of countries based on other factors.

# Decision Tree
I'm going to start by creating a decision tree for my *dataset*. I believe this will be beneficial to see, seeing as the decision tree will select the best predictors for my response variable. The first step here is to split my data into training and testing data.
```{r, class.source = "fold-show",message=FALSE}

set.seed(228)
train.index <- createDataPartition(dataset$WEI, p= 0.8, list = FALSE)
train <- dataset[train.index,]
test <- dataset[-train.index,]
```
I'm going to look at the proportion of developed vs developing countries within both my *test* and *train* data. This is so I can ensure that *test* is at least somewhat representative of the distribution within *train*.
```{r, class.source = "fold-show",message=FALSE}
tally(test$Status, format='proportion')
tally(train$Status, format = 'proportion')
tally(dataset$Status, format = 'proportion')
```
Both *test* and *train* appear to be adequately representative of *dataset*!

Now, I will create my decision tree. I will fit the decision tree to *train*, and I will set nearly every predictor variable (BESIDES NAME & WEG, for reasons of redundancy) as my X. I am doing this so that the decision tree can tell me which predictor variables have the highest influence on my response variable, WEI.
```{r, class.source = "fold-show",message=FALSE}
tree <- rpart(WEI~ HDG + Status + Lifeexpectancy + AdultMortality + infantdeaths + HepatitisB + Measles +  Polio + Diphtheria + Population + Schooling, data=train, cp=.01)

rpart.plot(tree)
```

It appears there are three predictors that have the highest correlation with WEI scores : HDG (Human Development Group), Years of Schooling, and Adult Mortality Rates. For further exploration, I'm going to create a Random Forest to find our best model.

# Random Forest & RMSE
I'm going to have to not use any variables that have "NA" values to be able to properly crearte a random forest. This unfortunately means removing Schooling. However, I don't want to remove Schooling because according to the decision tree, it's one of the most important predictors! So, I am going to use regression imputation to create some estimated Schooling values for any of our countries that have Schooling = NA. 
```{r, class.source = "fold-show", message=FALSE}
missing <- is.na(dataset$Schooling) 

sum(missing)

which(missing)

lm(Schooling~WEI, data=dataset)


dataset$Schooling[11]<-2.574+17.943*(0.707)
dataset$Schooling[30]<-2.574+17.943*(0.778)
dataset$Schooling[31]<-2.574+17.943*(0.752)
dataset$Schooling[47]<-2.574+17.943*(0.686)
dataset$Schooling[72]<-2.574+17.943*(0.399)
dataset$Schooling[105]<-2.574+17.943*(0.510)
```
Yet another example of never being done with cleaning the data! Now, I have to recreate my training and testing dataset without these missing values. 
```{r, class.source = "fold-show", message=FALSE}
set.seed(228)
train.index <- createDataPartition(dataset$WEI, p= 0.8, list = FALSE)
train <- dataset[train.index,]
test <- dataset[-train.index,]
```

Then, I can finally create my Random Forest
```{r, class.source = "fold-show",message=FALSE}
set.seed(228)
rf_model <-randomForest(WEI ~ HDG + Status + Lifeexpectancy + Schooling + AdultMortality + infantdeaths + Measles +  Polio + Diphtheria ,data=train, proximity=TRUE, ntree=1000)
rf_model

```
The mean of squared residuals is pretty close to 0, which means the model is good! However, I'm going to also try running a Random Forest model in which I only use the three best predictor variables as identified by my Decision Tree. I will then compare the two mean of squared residuals.
```{r, class.source = "fold-show",message=FALSE}
set.seed(228)
rf_model2 <-randomForest(WEI ~ HDG +  Schooling + AdultMortality  ,data=train, proximity=TRUE, ntree=1000)
rf_model2

```
The mean of squared residuals is ever so slightly higher for this model, and the Var explained is slightly lower. I'm going to use an RMSE test to compare the two models.


Now, I'm going to boost the two models. Before doing so, I am changing HDG's class to ordered, and Status's class to ordered, so they arecompatible with the boosted model.
```{r, class.source = "fold-show",message=FALSE}
train$HDG <- as.ordered(train$HDG)
train$Status <- as.ordered(train$Status)
boost1 <- gbm(WEI ~ HDG + Status + Lifeexpectancy + Schooling + AdultMortality + infantdeaths + Measles +  Polio + Diphtheria, data=train,distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, n.minobsinnode=5)
boost2 <- gbm(WEI~ Schooling + HDG + AdultMortality , data = train, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, n.minobsinnode=5)

```
Now I will evaluate the two models using predictions & RMSE.
```{r, class.source = "fold-show", message=FALSE,warning=FALSE}
predictions1 <- predict(boost1,newdata=test)
rmse(actual=test$WEI, predicted=predictions1)

predictions2 <- predict(boost2, newdata = test)
	rmse(actual=test$WEI, predicted=predictions2)
```
Once again, the model with nearly very variable has proved to perform *slightly* better than the model with only 3 predictor variables. However, this difference is **so small**, and I think it is worth it to sacrifice this small difference for the sake of keeping our model SIMPLE! The model still performs very well. So, I'm deciding to keep the model that only contains the three best predictor variables! 

# Final Visualizations
```{r, message=FALSE,warning=FALSE}
wei.map <- right_join(dataset,World,by="name")

wei.map.sf  <- wei.map %>% 
  st_as_sf()
```
I'm going to create FOUR maps here: One exactly like my first visualization that shows WEI scores by country, and then three that shows the distribution of each of my predictor variables! I will display all of these maps side by side.
```{r, class.source = "fold-show", message=FALSE,warning=FALSE}
tmap_mode("plot")

 tm_shape(wei.map.sf) + tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("HDG",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Human Development Groups", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("Schooling",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Years of Schooling", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("AdultMortality",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Adult Mortality per Population of 1000", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)
```

These are sort of confusing. For one, the HDG categories are out of order, which is very hard to read. Secondly, the adult mortality map should probably have the colors flipped, because lower adult mortality corresponds to *higher* WEI.

I'm going to fix them, and insert the png below.

```{r}
knitr::include_graphics("/Users/jasmineday/Downloads/WEIPLOT.jpg")
```

 And there we have our final visualization!  I hope you enjoyed reading through this project as much as I enjoyed my STAT228 class :)
 
# BONUS: Testing the Prediction Model
After finishing the first draft of my project, I figured I wanted to add a section in which I take a small handful of countries that are NOT represented in the *wei* dataset and use our model to predict the WEI scores for these nations. I will be using the countries Kazakhstan, New Zealand, Sudan, Haiti, and Ukraine. I will use *led_2015* to source the HDG, Schooling, and AdultMortality for each of these countries.

```{r, class.source = "fold-show", message=FALSE,warning=FALSE}
finalmodel <- lm(WEI~HDG+Schooling+AdultMortality,data=dataset)
finalmodel
#Kazakhstan:
  #HDG=Very high
  #Schooling=15
  #AdultMortality=198
Kazakhstan_WEI <- .0186 + .0717 + .0282*15 + .0002*198
#New Zealand  
  #HDG=Very high
  #Schooling=19.2
  #AdultMortality=66
NZ_WEI <- .0186 + .0717 + .0282*19.2 + .0002*66
#Sudan
  #HDG=Low
  #Schooling=7.2
  #AdultMortality=225
Sudan_WEI <- .0186 - .0749 + .0282*7.2 + .0002*225
#Haiti
  #HDG=Medium
  #Schooling=9.1
  #AdultMortality=24
Haiti_WEI <- .0186 - .0202 + .0282*9.1 + .0002*24
#Ukraine
  #HDG=High
  #Schooling=15.3
  #AdultMortality=195
Ukraine_WEI <- .0186 + .0282*15.3 + .0002*195
```

Kazakhstan imputed/predicted WEI score = .553
New Zealand imputed/predicted WEI score = .645
Sudan imputed/predicted WEI score = .192
Haiti imputed/predicted WEI score = .260
Ukraine imputed/predicted WEI score = .489

These don't feel entirely accurate. For example, Sudan's imputed WEI score is just above that of Yemen's, which is the lowest in the world. However, it still can be beneficial to see where there may be potential flaws in using this linear model to predict WEI scores! WEI score data on *all* countries is not public, so I am unfortunately unable to check the actual accuracy of these predictions. 
